ViralMSA: massively scalable reference

您所在的位置:网站首页 Massively scalable ViralMSA: massively scalable reference

ViralMSA: massively scalable reference

#ViralMSA: massively scalable reference| 来源: 网络整理| 查看: 265

Abstract Motivation

In molecular epidemiology, the identification of clusters of transmissions typically requires the alignment of viral genomic sequence data. However, existing methods of multiple sequence alignment (MSA) scale poorly with respect to the number of sequences.

Results

ViralMSA is a user-friendly reference-guided MSA tool that leverages the algorithmic techniques of read mappers to enable the MSA of ultra-large viral genome datasets. It scales linearly with the number of sequences, and it is able to align tens of thousands of full viral genomes in seconds. However, alignments produced by ViralMSA omit insertions with respect to the reference genome.

Availability and implementation

ViralMSA is freely available at https://github.com/niemasd/ViralMSA as an open-source software project.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Real-time or near real-time surveillance of the spread of a pathogen can provide actionable information for public health response (Poon et al., 2016). Though there is currently no consensus in the world of molecular epidemiology regarding a formal definition of what exactly constitutes a ‘transmission cluster’ (Novitsky et al., 2017), all current methods of inferring transmission clusters require a multiple sequence alignment (MSA) of the viral genomes: distance-based methods of transmission clustering require knowledge of homology for accurate distance measurement (Pond et al., 2018), and phylogenetic methods of transmission clustering require the MSA as a precursor to phylogenetic inference (Balaban et al., 2019; Prosperi et al., 2011; Ragonnet-Cronin et al., 2013).

The standard tools for performing MSA such as MAFFT (Katoh and Standley, 2013), MUSCLE (Edgar, 2004), and Clustal Omega (Sievers and Higgins, 2014) are prohibitively slow for real-time pathogen surveillance as the number of viral genomes grows. For example, during the COVID-19 pandemic, the number of viral genome assemblies available from around the world grew exponentially in the initial months of the pandemic, but MAFFT, the fastest of the aforementioned MSA tools, scales quadratically with respect to the number of sequences.

In the case of closely-related viral sequences for which a high-confidence reference genome exists, MSA can be accelerated by independently comparing each viral genome in the dataset against the reference genome and then using the reference as an anchor to merge the individual alignments into a single MSA (Pond et al., 2018).

Here, we introduce ViralMSA, a user-friendly open-source MSA tool that utilizes read mappers such as Minimap2 (Li, 2018) to enable the reference-guided alignment of ultra-large viral whole-genome datasets.

2 Related work

VIRULIGN is another reference-guided MSA tool designed for viruses (Libin et al., 2019). While VIRULIGN also aims to support MSA of large sequence datasets, its primary objective is to produce codon-correct MSAs (i.e. avoiding frameshifts), making it appropriate for aligning coding regions, whereas ViralMSA’s primary objective is to align whole viral genomes in real-time. Further, ViralMSA is orders of magnitude faster than VIRULIGN (Fig. 1) and uses a fraction of the memory.

Fig. 1.Execution time. Execution time for SARS-CoV-2 MSAs (genome length 29 kb) estimated by VIRULIGN, MAFFT, and ViralMSA for various dataset sizes. All runs were executed sequentially on an 8-core 2.0 GHz Intel Xeon CPU with 30 GB of memoryOpen in new tabDownload slide

Execution time. Execution time for SARS-CoV-2 MSAs (genome length 29 kb) estimated by VIRULIGN, MAFFT, and ViralMSA for various dataset sizes. All runs were executed sequentially on an 8-core 2.0 GHz Intel Xeon CPU with 30 GB of memory

3 Results and discussion

ViralMSA is written in Python 3 and is thus cross-platform. ViralMSA depends on BioPython (Cock et al., 2009) and whichever read mapper the user chooses, which is Minimap2 by default (Li, 2018). In addition to Minimap2, ViralMSA supports STAR (Dobin et al., 2013), Bowtie 2 (Langmead and Salzberg, 2012) and HISAT2 (Kim et al., 2019), though the default of Minimap2 is strongly recommended: Minimap2 is much faster than the others (Li, 2018) and is the only mapper that consistently succeeds to align all genome assemblies against an appropriate reference across multiple viruses. ViralMSA’s support for read mappers other than Minimap2 is primarily to demonstrate that ViralMSA is flexible, meaning it will be simple to incorporate new read mappers in the future.

ViralMSA takes the following as input: (i) a FASTA file containing the viral genomes to align, (ii) the GenBank accession number of the reference genome, and (iii) the mapper to utilize (Minimap2 by default). ViralMSA will pull the reference genome from GenBank and generate an index using the selected mapper, both of which will be cached for future alignments of the same viral strain, and will then execute the mapping. ViralMSA will then process the results and output an MSA in the FASTA format. For commonly, studied viruses, the user can simply provide the name of the virus instead of an accession number, and ViralMSA will select an appropriate reference genome. The user can also choose to provide a local FASTA file containing a reference genome, which may be useful if the desired reference does not exist on GenBank or if the user wishes to conduct the analysis offline.

Because it uses the positions of the reference genome as anchors with which to merge the individual pairwise alignments, ViralMSA only keeps matches, mismatches, and deletions with respect to the reference genome: it discards all insertions with respect to the reference genome. For closely-related viral strains, insertions with respect to the reference genome are typically unique and thus lack usable phylogenetic or transmission clustering information, so their removal results in little to no impact on downstream analyses (Table 1).

Table 1.

MSA accuracy

Virus . MAFFT (S) . ViralMSA (S) . MAFFT (P) . ViralMSA (P) . Ebola 0.9957 0.9873 0.9998 0.9816 HCV 0.9995 0.9506 0.9999 0.9678 HIV-1 0.9786 0.9705 0.9957 0.9941 Virus . MAFFT (S) . ViralMSA (S) . MAFFT (P) . ViralMSA (P) . Ebola 0.9957 0.9873 0.9998 0.9816 HCV 0.9995 0.9506 0.9999 0.9678 HIV-1 0.9786 0.9705 0.9957 0.9941  Open in new tab Table 1.

MSA accuracy

Virus . MAFFT (S) . ViralMSA (S) . MAFFT (P) . ViralMSA (P) . Ebola 0.9957 0.9873 0.9998 0.9816 HCV 0.9995 0.9506 0.9999 0.9678 HIV-1 0.9786 0.9705 0.9957 0.9941 Virus . MAFFT (S) . ViralMSA (S) . MAFFT (P) . ViralMSA (P) . Ebola 0.9957 0.9873 0.9998 0.9816 HCV 0.9995 0.9506 0.9999 0.9678 HIV-1 0.9786 0.9705 0.9957 0.9941  Open in new tab

Correlation coefficients are shown for Mantel tests between curated ‘ground truth’ MSAs and those estimated by MAFFT and ViralMSA. S and P denote Spearman and Pearson correlation, respectively. 1 indicates perfect correlation, −1 indicates perfect anticorrelation, and 0 indicates no correlation.

In order to assess MSA estimation accuracy, we obtained curated Ebola, HCV, and HIV-1 full-genome MSAs from the Los Alamos National Laboratory (LANL) sequence databases, which we used as our ground truth. In order to benchmark MSA runtime, we obtained a large collection of SARS-CoV-2 complete genomes from the Global Initiative on Sharing All Influenza Data (GISAID) database. VIRULIGN crashed when run on all datasets aside from the SARS-CoV-2 dataset.

To measure performance, we subsampled the full SARS-CoV-2 dataset, with 10 replicates for each dataset size, and then computed MSAs of each replicate. ViralMSA is consistently orders of magnitude faster than both MAFFT and VIRULIGN (Fig. 1 and Supplementary Fig. S1). Further, for all SARS-CoV-2 datasets, both ViralMSA and MAFFT required

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

 

Dobin A. et al.  (

2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29, 15–21.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

 

Edgar R.C. (

2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinform., 5, 113.

Google Scholar

CrossrefSearch ADS

WorldCat

 

Katoh K. , Standley D.M. (

2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol., 30, 772–780.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

 

Kim D. et al.  (

2019) Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol., 37, 907–915.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

 

Langmead B. , Salzberg S.L. (

2012) Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9, 357–359.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

 

Li H. (

2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34, 3094–3100.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

 

Libin P.J.K. et al.  (

2019) VIRULIGN: fast codon-correct alignment and annotation of viral genomes. Bioinformatics, 35, 1763–1765.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

 

Novitsky V. et al.  (

2017) Phylogenetic inference of HIV transmission clusters. Infect. Dis. Transl. Med., 3, 51–59.

Google Scholar

OpenURL Placeholder Text

WorldCat

 

Piñeiro C. et al.  (

2020) VeryFastTree: speeding up the estimation of phylogenies for large alignments through parallelization and vectorization strategies, 36, 4658–4659.OpenURL Placeholder Text

WorldCat

Pond S.L.K. et al.  (

2018) HIV-TRACE (TRAnsmission Cluster Engine): a tool for large scale molecular epidemiology of HIV-1 and other rapidly evolving pathogens. Mol. Biol. Evol., 35, 1812–1819.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

 

Poon A.F.Y. et al.  (

2016) Near real-time monitoring of HIV transmission hotspots from routine HIV genotyping: an implementation case study. Lancet HIV, 3, e231–e238.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

 

Prosperi M.C.F. et al.  (

2011) A novel methodology for large-scale phylogeny partition. Nat. Commun., 2, 321.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

 

Ragonnet-Cronin M. et al.  (

2013) Automated analysis of phylogenetic clusters. BMC Bioinform., 14, 317.

Google Scholar

CrossrefSearch ADS

WorldCat

 

Robinson D.F. , Foulds L.R. (

1981) Comparison of phylogenetic trees. Math. Biosci., 53, 131–147.

Google Scholar

CrossrefSearch ADS

WorldCat

 

Sievers F. , Higgins D.G. (

2014) Clustal Omega, accurate alignment of very large numbers of sequences. Methods Mol. Biol., 1079, 105–116.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

 

Tamura K. , Nei M. (

1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol., 10, 512–526.

Google Scholar

PubMedOpenURL Placeholder Text

WorldCat

 

Tavaré S. (

1986) Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures Math. Life Sci., 17, 57–86.

Google Scholar

OpenURL Placeholder Text

WorldCat

  © The Author(s) 2020. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected] article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)


【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3